Designed active-site library reveals thousands of functional GFP variants

Mutations in a protein active site can lead to dramatic and useful changes in protein activity. The active site, however, is sensitive to mutations due to a high density of molecular interactions, substantially reducing the likelihood of obtaining functional multipoint mutants. We introduce an atomistic and machine-learning-based approach, called high-throughput Functional Libraries (htFuncLib), that designs a sequence space in which mutations form low-energy combinations that mitigate the risk of incompatible interactions. We apply htFuncLib to the GFP chromophore-binding pocket, and, using fluorescence readout, recover >16,000 unique designs encoding as many as eight active-site mutations. Many designs exhibit substantial and useful diversity in functional thermostability (up to 96 °C), fluorescence lifetime, and quantum yield. By eliminating incompatible active-site mutations, htFuncLib generates a large diversity of functional sequences. We envision that htFuncLib will be used in one-shot optimization of activity in enzymes, binders, and other proteins.

Briefly described (if I understand the method correctly), the application to GFP starts with manually selecting 27 sites around the chromophore assumed to be important for function and followed by these steps: A first filtering is based on a phylogenetic analysis and single-site Rosetta calculations and results in a list of promising amino acid substitutions at a subset of the 27 sites. Then, Rosetta calculations of multipoint mutations constructed in spatially local neighbourhoods are used as input for a neural network model that outputs a ranked list of substitutions from which a selected top is used to construct a combinatorial library. Two such libraries were made and screened for fluorescence in high throughput and for two excitation wavelengths. Screened variants from the two libraries were pooled and a common random forest model was trained to predict the functional state of all 11 plus 0.93 million combinatorial variants of the two libraries respectively. The usefulness of individual substitutions in functional multipoint mutants is analysed based on the predicted function states.
Overall, I find the paper interesting and it appears that the method can make combinatorial libraries that are enriched in successful and potentially interesting functional variants. Although it is not surprising to find interactions between buried and spatially close sites, the paper presents a method that seem capable of handling this, i.e. that a substantial fraction of the variants generated are functional. How many and what fraction is still unclear (see below for discussion on this). One area to improve the manuscript is that the computational method, details of the highthroughput screen and random forest model are relatively poorly described, and in places also poorly validated, which leaves substantial uncertainties for the reader, and makes it very difficult to evaluate the method presented.
More detailed comments and suggestions for improvements.

1.
There are some claims in the abstract that are difficult to find support for in the paper: 1a. "We screened 11 million htFuncLib designs that targeted the GFP chromophore-binding pocket …". The experimentally screened library in theory has 11 million members but I do not see evidence that all have been transformed and screened. Fig. 2C suggests a maximum of ~10^5 experimentally screened variants. If the abstract refers to the in silico screen using the random forest model, please make that clear, and also say how many are known to have been screened experimentally.
1b. "… and isolated >16,000 unique fluorescent designs encoding as many as eight active-site mutations". In the body of the paper the 16,000 functional variants are referred to as "putative" or "potentially active" which I find more reasonable due to the "Relatively loose criteria (enrichment in the selected population relative to the presorted population >1)" by which these are identified. If the authors wish to claim to have identified 16,000 functional variants, the rest of the paper should be more focused on validating this, see in particular below regarding the uncertainty of the high-throughput assay.
1c. "By eliminating incompatible active-site mutations, htFuncLib generates a greater diversity of functional sequences than evolutionary or mutational scanning approaches for optimizing enzymes, binders, and other functional proteins." Fig. 3B illustrates that when only looking at variants containing functional site mutations, i.e. removing most variants from the other sets, then htFuncLib seems to cover a larger space. However, it is not clear from the abstract that the other sets are heavily reduced, in particular because variants with more mutations are likely to have at least one outside the functional site. The text in the conclusion is a bit clearer on this point, but it would be good that the abstract reflects this.

2)
Major parts of the methods are not described in any substantial detail. For example the deep sequencing and random-forest model which are described with six and three sentences respectively. This makes it almost impossible to judge the results and conclusions of the paper, and for others to build on this work. Ideally, one should be able to examine all major parts from reading the paper, for example, only details should by looked up in the code.
3) I find it surprising that PROSS-eGFP performs relatively poorly in the computational and experimental stability evaluation (Fig. 5a). If I understand correctly, PROSS-eGFP should be optimized by something very similar to the filtering method so why does so many beneficial substitutions show up in the filtering? The only information I can find regarding the "filtering" is that this leaves a combinatorial space of 10^18 variants; it would be interesting to see the resulting substitutions presented as in supplementary table S1. Also, Y145F is described in ref. 33 as a part of PROSS-eGFP but here reintroduced as a new mutation? Similarly, T167I is presented as a new substitution here, but seems to already be in the background in ref. 33 (looking at the DNA sequence below the supplementary table S2). Also, in ref. 33 PROSS-eGFP is reported to have 11 or 12 substitutions relative to eGFP but here only one (interpreting undescribed tics in Fig.  5A and the caption of Fig. 3: "PROSS-eGFP (and eGFP, which are nearly identical in the designed positions)". All in all, this is somewhat unclear from reading the paper. Maybe it is just me having a hard time finding the data, but then that might also be the case for other readers. It would have been useful to report exact sequences of eGFP and PROSS-eGFP (and sfGFP) and substitutions used. Also, please comment on the apparent low stability of PROSS-eGFP and the relationship to the fact that it is optimized.

4)
Regarding the number of tested designs, I cannot make the numbers in supplementary table S5 match the attached csv file: The csv holds 72 sequences incl. 3 controls, i.e. 69 tested designs that all seems to be functional, whereas the text reports 68 designs tested in total? There may be a way to add this up but I could not find it. The 19 variants in sfGFP background are listed in the csv but not in table S5. Also, I can only see 15 functional random-forest designs and 10 AmCyan variants selected for spectral shift in the csv but 16 and 14 are reported in table S5? Most importantly, please report the sequences and number of mutations of the failed design (e.g. in the csv and/or in the text) as this is essential information for the community. Also, please describe in the paper the nine "Additional designs" listed in Table S5 as selected from deep sequencing, of which only one is functional. Again, apologies if these were presented somewhere, but I could not find it.

5)
In the top plot of Fig 3A, it seems that 100% of single variants are functional -perhaps this is WT, i.e. zero mutations, and the x-axis is shifted? If this is the case, the RF model seems very optimistic in identifying function with almost 100% single mutants functional (currently x=2) or at least more than NGS. With this, I find it odd that the bottom plot shows slightly more functional single variants in NGS than in RF (assuming most single mutants are observed in NGS). Please check and clarify. Fig 3A is the high fraction of the "16,000 potentially active designs" with more than four mutations. This should be validated better if authors wish to report 16,000 functional variants identified in the abstract. First, the authors only report tests of designs up to 8 mutations whereas a substantial fraction of functional NGS variants have more than 8 mutations. Please report the success rate in experimental tests per number of mutations (also related to pt. 4 above on information about the failed designs). Second, please comment on the statistics in supplementary Table S2: The false-positive rate is 1/12, i.e. ~8% are the falsely predicted functional out of the actual non-functional. This is a quite high number since the paper reports ~90% non-functional (Fig. 2C), i.e. with 100,000 non-functional variants, 8,000 are expected to be false positive. This reflects the imbalance in training on a high-prevalence set (most functional) and applying to a low-prevalence set (fewest functional). Third, there is very little indication of the uncertainty in the NGS experiment. It requires a very deep sequencing to cover 100,000 unique variants without paired-end reads, to an extent that warrants calculation of enrichments. To calculate enrichments, the authors need to have a good idea about the abundance of a variant prior to selection and there is no discussion on how this is addressed. Please comment on this, e.g. size the of transformed library, how many cells are expected to have more than one plasmid, FACS coverage (cells sorted per library size), sequencing coverage (average number of reads per unique variant), which region of GFP is sequenced (maximum 600 bases are sequenced), are all functional variants observed in the non-sorted sequencing, are pseudo-counts applied, frequency of unexpected substitutions and how these are handled, etc.

7)
It would be interesting to see some more details on the construction of the neighbourhoods. E.g. a supplementary table listing the sites of the filtered mutations could also list the neighbourhood of each site and the number of calculated multipoint mutations. Are these mostly double mutants or higher order mutants?
8) It would be appropriate to make some quantitative comparison with the previous version of FuncLib (ref. 24), e.g. by the success rates obtained in experimental validations.

9)
It would be interesting to know with a bit more detail on the phylogenetic analysis. The authors write "In this selection step, we keep mutations that are likely to be present in the natural diversity of sequence homologs and that are moreover predicted not to destabilize the protein native state according to atomistic design calculations35". GFP is sometimes considered not to have very many natural homologs and fpbase.org (ref 35) contains a lot of synthetic variants. Please give the number of sequences in the phylogenetic analysis and, if possible, indicate how many of these are natural, e.g. belongs to a reference genome.

10)
The methods section describes "An alternative mutation selection approach that uses Integer Linear Programming" which is only briefly referenced in the text. This should either be removed or the authors should show the results.

11)
In the paragraph starting with "Our working hypothesis is that epistatic interactions most frequently arise from three molecular sources (Supplementary Figure 1)" the third point is unclear and not illustrated in supplementary Fig. S1: "(3) stability-mediated interactions caused by the nonlinear relationship between the free energy of folding and the fraction of natively folded and functional protein".

Minor points
1) It would be helpful if Fig. 1 more directly illustrated what "filtering" and "EpiNNet enrichment" means and where in the pipeline it is performed 2) Should T65S be in supplementary Fig. S2? It would be useful for the discussion in Fig. 4 3) Fig. 4A caption "GFP488/53" should be "GFP488/530"

4)
In methods section under FACS sorting: "E. cloni" should probably be "E. coli", though I quite like the name "cloni"

5)
In the introduction the authors write "and functional multipoint mutants are exceptionally rare", but do not provide a reference to this general statement. Similarly with the statement "Epistasis is a key reason for the low tolerance to multipoint active-site mutations." Reviewer #2 (Remarks to the Author): In this study, Weinstein and colleagues use a combination of energetic modeling and highthroughput screening to identify GFP variants with multiple mutations, addressing the challenge of potential negative epistasis between mutations reducing the hit-rate. htFuncLib was used to design a set of point mutations and then combinations of mutations that were energetically favorable. Then, a machine learning EpiNet model was trained to discriminate favorable and unfavorable combinations of mutations. Hits from this approach included those with > 8 mutations, which exceeded the tolerated mutational perturbation load from previous design approaches. This work and a companion submission on enzyme engineering show that issues with epistasis can begin to be addressed by combining judicious energetic modeling combined with training of machine learning models. An important and relevant study to the protein engineering field. The work is technically sound and clearly presented.
Two comments that should be addressed: (1) There is no functional goal in these libraries -i.e. quantum yield, photostability, color. How would these methods be adapted if a particular functional feature, not just structure and stability were to be optimized. Excitation and emission spectral properties were not described (peak wavelenthgs). Can models be trained to identify what features contribute to photophysical properties?
(2) This training of EpiNNet should be discussed in the context of the choice of host protein -a stable version of GFP -and specifically the work earlier this year from Kondrashov (https://doi.org/10.7554/eLife.75842) showing that mutational landscapes that are flatter are not as useful for training models. Does the energetic modeling in htfuncLib work for more 'fragile' proteins where epistatic interactions can have a more pronounced effect on folding/function? Reviewer #3 (Remarks to the Author): In this manuscript, the authors introduce the htFuncLib, a protein-engineering pipeline to design and test variant libraries focused on protein functional sites. The motivation behind developing such a method is to increase the sampling efficiency and diversity around a protein functional site, which is usually highly conserved and sensitive to mutations. To achieve this goal, the authors have to overcome the epistatic effect by introducing multiple mutations simultaneously, which, in the past, has only been partially achieved by directed evolution through iterative mutationselection cycles. The htFuncLib method starts with low-energy PSSM-approved mutations in the functional site (by phylogenetic analysis and Rosetta energy calculation), then ranks and selects those mutations by their mutual compatibility (by a trained neuron network EpiNNet). After this in silico screening, DNA fragments encoding these compatible point mutations are assembled in an all-against-all combinatory library by Golden Gate method, tested by high-throughput FACS, and read out by deep sequencing. The authors apply this engineering pipeline to GFP's fluorescence functional site. The results are impressive: 1.) they explore a much bigger sequence space that is inaccessible in multiple previous attempts, 2.) the functional multipoint mutants after library selection show desired functional diversity in terms of protein stablity, fluorescence spectra, fluorescence lifetime, pH sensitivity, and fluorescence photo-stability, and 3.) the molecular mechanisms of epistasis underlying the successfully selected GFP variants are interesting for structural analysis. Overall, the manuscript presents a pragmatic way to diversify certain protein functions and I anticipate it will attract attention among protein engineers working towards protein tools (eg. imaging tools such as fluorescent proteins) and enzymes, thus I recommend this manuscript for publication after a minor revision. Below are my comments for the authors: 1. A direct comparison between FuncLib and htFuncLib would be necessary here. If adding a perceptron-based neural network (or ILP) machine learning module largely improves the end results, it would be worthwhile to ask what Rosetta method lacks and what role Rosetta design calculation plays in this new method. 2. The general applicability for other users and other proteins is not very clear. There are several manual steps in the Method description (Line 455,462,470). While it is understandable to introduce manual intervention on every steps during method development and the initial application, I would like the see how the authors plan to automate the pipeline for future applications.
3. The final paragraph in the Introduction is slightly an overstatement (Line 55, "arbitrarily large libraries" and Line 61, "millions(and potentially billions) of designs"). From the Method description, it is obvious that the size of the final library is a limiting factor for designing the combinatory library (Line 462-464, Line 494-497). I would suggest the authors to revise this paragraph to avoid misleading. 4. In Line 81-86 and Supplementary Figure 1, the authors listed three hypothesized sources for epistatic interactions. It is hinted in the text (Line 99, "penalize backbone deformation" and Line 105 "most likely to give rise to…") that the htFuncLib is focused on establishing type 1 epistatic interactions only (this is my speculation). It would make the manuscript easy to understand if the authors can offer a direct correspondence between the three types of epistatic interactions and the htFuncLib library design's target interactions. 5. Following comment 4, I am also confused about how the method deals with backbone movement upon introducing multiple mutations, eg. how does the calculation "penalizes backbone deformation(Line 105)"? 6. In Fig.2D, the overlay of the top-ranked mutations could be better illustrated in a different color. 7. In Fig.3A, "NGS" could be better named as "htFuncLib-NGS". I misunderstood it as all the nextgeneration sequencing data combined (or, does it really mean all the data combined? See, I'm confused.). 8. In Fig.3A, it does not make clear sense to me that the point mutants ("1" in the bottom plot) have a functional ratio of 100% ("1.0" in the top plot) for all the libraries and "RF" prediction. Is it a normalization point? If it is not a normalization point, does it indicate that the fluorescence threshold for defining "functional" is arbitrarily low in this analysis? In addition, since the other reference libraries (avGFP, cgreGFP, …) are sorted differently, I wonder how to justify this comparison of "functional" variants. 9. Fig.4 and Supplementary Fig. 8 are the same low-dimensional representation of protein fitness landscape labeled with "clean" Random Forest(RF) predicted functional mutations and "noisy" experimental data, respectively. The authors choose to focus their analysis and discussion on the RF-predicted results (Fig.4) in the main text. While this is totally reasonable with proper justification (as the authors have provided in line 355-356 for "false negative" and line 885-886 for "false positive"), it should be noted if the representative mutations have strong or weak signals in the experimental data. If they are completely missed in the FACS sorting and NGS sequencing, further experimental validation is needed to support the authors' claim. For example, the discussion on the "two long parallel tails" in line 323-344 is not very convincing to me since the same signals are not apparent in Supplementary Fig. 8A. To keep this part as a novel finding, I would suggest the authors test the representative mutations experimentally. 10. Reference data of transferring mutations to sfGFP are missing (Line 361-362). 11. For the 68 uniques designs(Line 349) chosen for protein purification and biochemistry characterization, how many functional-site mutations do they carry?
12. An open and honest discussion on the limitations of the htFuncLib method would make this manuscript stronger. From several places in the main text, the htFuncLib seems to require a very stable starting point and it cannot explicitly improve a specific aspect of the protein function. I think that general readers will appreciate an open discussion in this regard.
Reviewer #4 (Remarks to the Author): The manuscript describes development and application of the htFuncLib -a computational protein design workflow combining atomistic and machine-learning based approaches. The goal of the htFuncLib is to increase efficiency of laboratory screening efforts by eliminating poorly scoring combinations of mutations from combinatorial libraries, allowing exploration of highly epistatic fitness landscapes. Even for a limited set of manually curated designable positions, an astronomically large number of combinations makes exhaustive sampling of the full sequence space computationally intractable. To optimize amino acid composition for each position the authors first used phylogenetic information and in silico site saturation mutagenesis (SSM) to identify residue types most likely individually tolerable in the parent sequence context. However, as their computations show, a dominant fraction of the designs constructed by random combination of these individually beneficial mutants has substantially worse computational score relative to parental sequence.
The authors propose a simple and elegant computational procedure that apparently helps to alleviate this problem. The entire set of designable positions is split into spatial neighborhoods and the range of allowed amino acids for each position is flexibly adjusted so as to make the total number of sequence combinations for the neighborhood computationally tractable (under 10^6). Rosetta energy score is computed for each combination (or 10% of all combinations for large neighborhoods) in each neighborhood and combinations are classified as "good" or "bad" relative to the score of the parental sequence. Authors train a neural network (EpiNNet) to classify designs, and use the trained network to rank individual mutations in a way reflecting probability of the mutation to be in a "good" scoring combination. The main discovery of the study is that designs constructed from higher ranking mutations ("EpiNNet enriched") have a much higher chance to have better energy scores than designs constructed from a set of mutations filtered using phylogenetic information and single SSM computations.
Authors proceed to apply their workflow to construct a library of PROSS-eGFP -a previously optimized variant of avGFP. Sorting library using FACS indicated presence of the variants retaining fluorescence even with up to 8 mutations. While the functional status of the overwhelming majority of >16,000 variants was assigned based on their enrichment in NGS data, some variants were purified on a scale sufficient for more detailed biophysical analysis.
While the rationale of the method is well laid out and convincing, the findings seem a bit underwhelming.
First, it appears that library construction using high ranking mutations lead to a mostly very conservative set of allowed mutations. Considering this, the result of finding functionally active variants with up to 8 mutations in the active site becomes almost trivial. Authors remark on the inclusion of radical substitutions into the library, but it is not clear how often such mutations appear in the active variants and more importantly how often presence of the radical substitutions affect functionality in a practically significant way. Additionally, previous studies referenced in the manuscript (Somermeyer, 2022) indicated various levels of robustness for different variants of the GFP. avGFP was reported to have intermediate robustness (tolerating at most 4 mutations), but given extensive optimization of PROSS-eGFP it may be not surprising its robustness increased to the level allowing it to tolerate more mutations. It might be helpful to see how the method performs for less optimized proteins.
Second, it would be helpful to see a control experiment where a library is created by combining lowest ranking mutations, or selected positions completely randomized to be able evaluate significance of the sequence space optimization provided by EpiNNet or any other method.
Third, authors claim functional variants having a wide range of biophysical properties (spectral changes, thermostability changes, quantum yield, photostability, life time etc.). Given the multimodality of fluorescent function and its sensitivity to the immediate environment it is not surprising to discover variants have functional diversity. It is exciting to see few variants with greatly increased thermostability or fluorescence lifetime, but on the other hand almost all variants are less photostable making their use limited to specific applications. And yet it is hard to imagine similar results cannot be obtained with other types of diversification as exemplified by multiple examples of directed evolution experiments with fluorescent proteins. Interestingly, it appears none of the characterized variants have substantial changes in emission spectra, which is quite often the most desired feature to be modified.
Fourth, it is not clear how significant is the role of EpiNNet in ranking and selection of the mutations to be allowed in the library. Alternative methods to perform this task were described and applied to optimize enzymes (Fox, 2005 DOI: 10.1016/j.jtbi.2004.11.031; Fox 2007 https://doi.org/10.1038/nbt1286) On a subject of presentation. Lines 204-205 Describe how many libraries were constructed, and were there more than just nohbonds and hbonds libraries. Lines 213-216 Reference 20 used epPCR to construct libraries with no ability to target mutations specifically to chromophore-binding pocket, so there were probably no mutants with >= 5 mutations in the active site Lines 267-269 How conservation score is defined and deltaPSSM is computed? Lines 494-496 Not clear what role EpiNNet plays in selecting mutations for the library. It appears the EpiNNet is used to rank the mutations, but not select them. Therefore, comparison between ILP solution and EpiNNet mutation-selection process is confusing and misleading Line 459 should it be scores greater than -2 ? Lines 462-463 Methods section describing "Refinement and mutation scan" seems to contain text relevant for "Partial modeling and scoring". Lines 114-115 seems to erroneously reference an ILP solution to picking ranked mutations as a method to train a model to perform classification task. Line 584 … and purified using <what?> Line 618 should be pET28 instead of pBAD? Line 619 … restriction sites cloned introduced previously Line 625 BL21(DE3) cells not BL21 cells? Expression from pBAD vector can be done in BL21 cells, pET vectors with T7 promoters require BL21(DE3) cells Figure 3A top panel suggests all # mutations=1 are active in all libraries, which is probably incorrect. Also the color scheme is illegible, please make markers of different shapes to help read the graph. Figure 1C What is the side-bar color gradient illustrating? Page 3 line 105 and page 15 line 469 A unit of distance, presumably $, is missing. There is also inconsistency in font on page 15. Page 3 The definition given in parentheses of proximal positions is incomplete and misleading. Perhaps use definition given in methods which is much more effective.